Quasireplicas and universal lengths of microbial genomes 
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Statistical analysis of distributions of occurrence frequencies of short words in 108 microbial 
complete genomes reveals the existence of a set of universal "root-sequence lengths" shared by all 
microbial genomes. These lengths and their universality give powerful clues to the way microbial 
genomes are grown. We show that the observed genomic properties are explained by a model for 
genome growth in which primitive genomes grew mainly by maximally stochastic duplications of 
short segments from an initial length of about 200 nucleotides (nt) to a length of about one million nt 
typical of microbial genomes. The relevance of the result of this study to the nature of simultaneous 
random growth and information acquisition by genomes, to the so-called RNA world in which life 
evolved before the rise of proteins and enzymes and to several other topics are discussed. 

PACS numbers: PACS number: 87.14.Gg, 87.23.Kg, 89.70. +c, 89.75.-k, 02.50.-r, 05.65.+b 



Genomes are books of life for organisms and necessarily 
carry huge amounts of information. By and large bigger 
genomes carry more information than smaller ones (there 
are noted exceptions). Yet as far as we know genomes 
grew and evolved stochastically, modulated by natural 
selection pj. This raises a puzzling question: how does 
genomes grow stochastically and simultaneously accumu- 
late information? This paper uses the set of all 108 se- 
quenced complete microbial genomes as data in exploring 
ideas on randomness, entropy, information and growth 
with the aim of finding an answer. What emerges is the 
discovery of a set of universal "root-sequence lengths" 
governed by a simple exponential relation shared by all 
the microbial genomes and a model for genome growth 
that reproduces observed genomic data and that serves 
as at least a partial answer: genomes are "quasirepli- 
cas" grown mainly by maximally stochastic short segmen- 
tal duplications from random root-sequences of a univer- 
sal length of about 200 nt. 

In what follows we do some groundwork by first dis- 
cussing the relation between the relative spectral width 
of a distribution of occurrence frequencies for a set of 
random events and the size of the set, and the same for 
the set where all frequencies are multiplied by a factor, 
and then showing a simple relation between the spectral 
width and the Shannon information for such sets. We 
then examine certain statistical properties of complete 
microbial genomes, present the results and analyze them 
by way of our growth model. 

Consider a set of occurrence frequencies for r types of 
events, {fi\ Yn=ifi= N } = {fi\ N }> witn mean frequency 

— - 1/2 

/ and standard deviation (std) A=((/ — f) 2 ) . If each 
frequency is increased by a factor of m then / and std for 
the new set {f- = mfi\mN}, an m-multiple of {fi\N}, 



will both increase by a factor of m so that the relative 
spectral width, a = A//, will not change. 

Suppose {fi\N} is a set of frequencies of random events 
of equal likelihood. Then the event probability versus 
frequency will be nearly a Poisson distribution (provided 
N»t) and std A ran = (&/) 1 / 2 , where 6<1 is a factor 
depending only on r and approaches unity when r is 
large. That a scales as iV -1 / 2 for large N is the basis 
for a well known effect in thermodynamics: the average 
of some measure of a random system gains sharpness as 
the system gains size, and achieves infinite sharpness in 
the thermodynamical (large N) limit. 

Let {fi\N} be the frequency set for a "small" system 
Ss of random events, {f"\mN} be the set for a "large" 
system Sr. of random events, and {f- = mfi\mN}, the m- 
multiple of Ss, be the set for the system S' . By definition 
both Ss and St. are totally random while S' is only par- 
tially so. We have mf = /' = /" and a = a' = m}l 2 a"; 
S' has the large size of Sl and the large a of Ss- Com- 
pared to Ss, S' has the randomness of Ss, but repeated 
m times. Compared to Sl, S' is less random and more 
ordered by possessing a periodicity. 

Shannon expressed the information in a system in 
terms of decrease in uncertainty Q. Shannon's un- 
certainty, or entropy, for the system {fi\N} is H = 
— ^Afi/N) \og(fi/N). We define the information of the 
system as R = H max — H, which accords with the ther- 
modynamical notion that an increase in entropy results 
in a decrease in information. We are interested in cases 
when most of the fi's are non-zero, then H acquires its 
maximum value log r when all fi = f and one finds for a 
bell-shaped distribution 
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where the proportional coefficient C is about 0.5; it is 
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exactly 0.5 if {fi\N} has a Gaussian distribution. This 
reveals the close relation between information and rela- 
tive spectral width. For the three systems Ss, S' and 
Sl discussed above, R{Ss) = R(S') w Co 1 w mR{Si). 
That is, the partially ordered system 5' has the size of 
the larger system Sl and the higher information content 
of smaller system Ss ■ Summarizing, we have: (i) Given 
two totally random systems the larger system carries less 
information; (ii) Given two systems of the same size the 
one with the larger a has more order and carries more 
information. We should note that since they are simply 
"periodic" , m-multiples do not represent the best route 
by which a system grows and acquires information. In 
general, a large R or a is a necessary but not sufficient 
condition for high information content. In what follows, 
pending a verification of Eq. Q , we shall use the terms 
information and relative spectral width interchangeably. 

Consider now single strands of DNA, or nucleotide se- 
quences and view them as linear texts written in the four 
bases, or letters, A, C, G, T For a sequence of 

L nucleotides (nt) we denote by Sf~, or a fc-distribution, 
the set of frequencies {fi\L}k, where fi is the occurrence 
frequency of the i th fc-letter word, or fc-mer, that may be 
obtained by moving a sliding window of width k across 
the genome; r=4 fc and f—4^ k L. To measure the in- 
formation of a real genomic sequence relative to those 
expected of a random sequence of the same length (and 
base composition, see below) we define "relative spectral 
information" for bell-shaped distributions to be: 

M a = cr 2 f/b « (a/a ran ) 2 (2) 

where b is the factor associated with o~ ran . A random 
sequence is expected to have M a ~l. 

Suppose Q, an m-replica, is obtained by replicating m 
times a sequence Q', then every fc-distribution Sk of Q 
is an m-multiplc of the corresponding fc-distribution of 
Q' . In particular, if Q' is a random sequence, then we 
expect (when m«L) M,j{Sk) — m to a high degree of 
accuracy, independent of fc. This motivates the follow- 
ing: Given sequence Q of length L and fc-distribution Sk , 
L r = L/M a (Sk) is defined as the root-sequence length 
of Q for fc-mers. For as far as fc-mers are concerned, Q 
(not necessarily an m-replica) has the same relative spec- 
tral information as that of a random "root-sequence" of 
length L r . Only an to- replica of a random sequence is ex- 
pected to have L r 's independent of fc for fc < logL/ log 4. 

The 108 complete microbial genomes currently in the 
GenBank Q are heterogeneous in length - 0.4 to 7 million 
bases (Mb) - and base composition - 20 to 80% A+T. In 
most cases the numbers of A's and T's (and of C's and 
G's) in a genome are very similar. We therefore char- 
acterize the base composition of a genomes by a single 
parameter p, the combined probability of A and T, or C 
and G, whichever is greater. 

In Fig.QJthe 5-distributions (green/gray curves) of ran- 
dom sequences 2 Mnt long with p equal to 0.5 (A), 0.6 
(B) and 0.7 (C), respectively, are shown together with the 
per 2 Mnt 5-distributions (black) of three genomes with 



matching p values: A. fulgidus [(J (A), S. pneumoniae 
(B) and C. acetobutylicum Q (C). Only the distributions 
in (A) for the p = 0.5 sequences satisfy the bell-shape re- 
quirement to make Eq.JQ) directly applicable. In this case 
the random distribution is indeed Poissonian and the ge- 
nomic distribution is much wider, o~A.f u i,/o~ ran ~ 20 so 
that information- wise, as far as 5-mers are concerned, a 
2 Mnt stretch of the A. fulgidus genome is like the 400- 
replica of a random sequence merely 5 knt long. 
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FIG. 1: Comparison of 5-distributions of genome (black), 
random (green) and model (orange) sequences with p=0.5 
(A), 0.6 (B) and 0.7 (C), respectively. The genomes are A. 
fulgidus (A), S. pneumoniae (B) and C. acetobutylicumiG) . 

For random sequences with p > 0.5 the single Poisson 
distribution is split into fc + 1 smaller Poisson distribu- 
tions (Fig. n (B) and (C)), one for each of the subsets 
of fc-mers with m AT's (called m-sets), whose respective 
means are f m {p) = f2 k p m (l — p) k ~ rn , m= to fc. For 
the genomes the distributions for the m-sets are so broad- 
ened that no individual peak is discernable. Notably in 
cases when p > 0.5, the a for the whole distribution is 
mostly determined by the spread of the / m 's, which gives 

a w [2 k (p 2 + (1 - p) 2 ) k - 1] 1//2 , rather than by the infor- 
mation of the sequence. We thus generalize the definition 
for M a given in Eq.© to be the weighted average over 
the relative spectral information of the m-sets: 

M*=Y? L- 1 (2 k (k,m)f m )al m f m /b (3) 

^ — 'm— 

where (fc,m) is a binomial, ^ m 2 k {k,rn)f m = L. Since 
A and T (and C and G) are counted together and the 
number of each monomer in the sequence is fixed, the 
binomial factor is taken to be b = 1 — 2 1 ~ k . In practice, 
to circumvent large fluctuations in ak, m induced by small 
unevenness in the A/T (or C/G) contents - this can occur 
when f m is very large at fc=2 and 3 - each frequency 
is divided by a factor (2 fe /p" l (l — p) k ~ m ) Y[ s p™" , where 
s runs over the four bases and ^ s m s = fc. To verify 
Eq. 0, we also define a "relative Shannon information" 
Mr where a\ m in Eq. © is replaced by Rk, m - 

Fig. El shows log-log plots of M a and Mr versus se- 
quence length L computed from a "genome" set com- 
posed of 108 complete microbial genomes and two 
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control sets with lengths and base compositions match- 
ing those in the genome set: a "random" set of ran- 
dom sequences and a "replica" set of 100-replicas of 
random sequences. The results for the control sets are 
shown in the left-panel ((A) and (B)) of Fig. [21 and they 
are essentially independent of fc, sequence length L and 
base composition p and have the expected values: (A) 
Mr = C = 0.514±0.062 - this verifies Eq Q; (B) 
M a = 1.02±0.11 for the random set and M a = 101±12 
for the replica set. Each set contains 972 pieces of data 
and in each plot about 50% of the error comes from data 
for fc=2 ("□" in the figure) and 25% from fc=3 ("A"). 
This is because fk, m for these cases are very large and 
magnify fluctuations in (7fc, m in Eq 0. 




FIG. 3: L r versus k extracted from relative spectral infor- 
mation of the genome set (squares) and a set of 108 model 
sequences (triangles) whose lengths and base compositions 
match those in the genome set. Line shows mean of Eq. ||3J. 
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FIG. 2: The relative Shannon information (Mr) and relative 
spectral information (Ma) for three sets of sequences. (A) 
Mr for the random set; (B) Ma for the random and replicas 
sets; (C) Ma for the genome set. 

The right-panel of Fig. El shows Ma for the genome 
set, where each piece of data was multiplied by a factor of 
2io-fc £ (j e ii nea te data into different k groups for better 
viewing. Still essentially p- independent, the data are oth- 
erwise entirely different from those of the control sets: (i) 
For given k they form a band (std is about 50% of mean) 
that depends linearly on L, implying that L r = L/ (Mr) 
is a universal root-sequence length, i.e., the same for all 
microbial genomes, (ii) For given L, the mean log Mr 
decreases approximately linearly with increasing fc, such 
that the universal lengths (squares in Fig. satisfy an 
approximate exponential relation of the form 



L r (k) = A x t k ; 2 < k < 10 



(4) 



with A = 48 ± 24 nt and t = 2.53. We remark that 
the property revealed by the log-log plot in Fig. [3 (C) 
is different from Zipf 's law [9j and its variants that have 
been much discussed for fc-mer frequencies (k > 6) in 
biological sequences 0, . 

The universality of the L r 's suggests the existence of a 
universal mechanism for (microbial) genome growth from 
proto-genomes of a universal initial length. The very 



large values of M a (hence the shortness of the L r 's) for 
the smaller fc's imply a mechanism involving a great deal 
of replication or duplication. One obvious mechanism, 
growth mainly by whole-genome replications [Til ll2T | is 
ruled out because that would yield fc-independcnt L r 's, 
contrary to data. The observed strong fc-dependence of 
L r suggests a more complex duplication process. 

We show that "universal genomes" generated in a sim- 
ple and biologically plausible growth model [l3l IbH ] pos- 
sess properties similar to those of microbial genomes. In 
the model the initial condition of a genome is a random 
sequence about L w 200 nt long with a base compo- 
sition characterized by a given value of p. The condi- 
tion Lo < L r (2) w 300 nt is necessary if the large val- 
ues of M a for the small fc's are to be attained. The 
genomes then grows by random short segmental duplica- 
tions - or quasireplication - possibly modulated by ran- 
dom single mutations. The model shares some features 
with those used to explain the power-law behavior of the 
occurrence frequency of genes in genomes [TEL IT^ . ex- 
cept that there the units of duplication are genes, not 
the short oligonucleotides used here. The quasireplica- 
tion process is maximally stochastic: a segment of length 
I, chosen according to the probability density function 
f(l) = l/(an\)(l/a) n e~ l / a , is copied from one site and 
inserted into another site, both randomly selected. The 
Erlang function, the integer n=2 and the length scale 
a=6.7 (nt) were determined by data, implying a typical 
length of 20 ± 12 nt for the duplicated segments. 

In Fig. |21 the L r 's (triangles) extracted from a set of 
108 model sequences with length and base composition 
matching those in the genome set and generated in sil- 
ico by quasireplication are compared with the L r 's for 
the genome set (squares). The two sets of lengths es- 
sentially agree although those from the model sequences 
have a slightly weaker fc-dependence. The fc-distributions 
of 5-mers computed from three representative model se- 
quences with p= 0.5, 0.6 and 0.7, respectively, are shown 
as orange curves in Fig. ^ These results show that our 
very simple growth model, in being able to produce se- 
quences that faithfully exhibit signature global properties 
of microbial genomes, likely has captured the essence of 
the real genomic growth process. We believe that with- 



4 



out the two main ingredients of our model, very short 
initial genome length (< 200 nt) and random duplication 
of short segments, no simple growth model can produce 
the results shown in Fig. [21 

We call sequences generated in our model quasireplicas. 
Unlike an m-replica, a quasireplica is globally aperiodic. 
If the length of a quasireplica is L, then the maximum k 
for which Eq. Q applies is given by k max rs log L/ log 4. 
For k < k max a quasireplica acts as an m-replica where 
m = L/L r (k), while for k » k max it appears essen- 
tially as a random sequence. Quasireplicas are partially 
ordered, highly complex and evidently capable of car- 
rying large amounts of information. Our study shows 
that microbial genomes are self-organized quasireplicas 
belonging to the class given by A « 48 nt and t 2.53 of 
Eq. Q . It is a common feature of complex self-organized 
systems to exhibit power-law relations 01 an d the rela- 
tion between our model and Eq. Q is being explored. 

Quasireplicas are extremely robust. We have verified 
that, provided the typical duplicated segment length is 
significantly greater than k max , quasireplication (includ- 
ing simple replication) of a quasireplica begets a longer 
quasireplica of the same class. We are currently studying 
eukaryotic genomes in terms of quasireplicas and results 
will be reported elsewhere. Based on findings obtained 
so far we make the following proposition: the ancestors 
of microbial genomes underwent a fundamental transi- 
tion in their growth and evolution shortly after they had 
reached a length of about 200 nt, by which time they had 
acquired a rudimentary duplication machinery, and then 
grew (and diverged) mainly by short-segment quasirepli- 
cation to become quasireplicas of the order of 1 Mnt long. 
Assuming this proposition to be substantially true we 
mention some implications. 

It seems that by adopting the natural method of 
quasireplication for growth, microbial genomes also 
adopted a superb strategy for information acquisition 
and accumulation, and in doing so they left a clear evolu- 



tionary track: the universal root-sequence lengths. This 
seems to put the onset of quasireplication in the posi- 
tion of being another, hitherto unknown, major tran- 
sition in evolution [lq. Before that transition the an- 
cestral genomes and their prebiotic precursors somehow 
acquired a store of information - including a rudimen- 
tary duplication machinery - and after the transition the 
genomes evolved via quasireplication. The smallncss of 
the genomes at the transition - less than a quarter of the 
size of a present-day gene coding for a typical enzyme - 
implies that at the time they must have lived in an "RNA 
world" devoid of proteins pjj |2(j ■ It is likely that the ri- 
bozymes that made up the duplication machinery at that 
time were not much bigger than the smallest ribozymes 
now extant, about 31 to 50 nt |2l|; hence the average du- 
plicated segment length of about 20 nt would have been 
effective in propagating coded information which, pre- 
sumably, could later be locally varied under the combined 
forces of mutation and natural selection for adaptation 
to new purposes. An RNA world reigned no more than 
600 million years, from about 4.2 (when the earth cooled 
down) to 3.6 (when protein must have ap pea red) billion 
years ago - probably even much shorter [2(|. It is not 
necessary that genomes grew to their respective current 
lengths during that period. It is sufficient that during 
that period, growth by short segmental quasireplication 
produced basic quasireplicas which, after the rise of pro- 
teins and enzymes, could be further expanded upon via 
quasireplication by duplicating longer segments, includ- 
ing genes [T5l fl6| . Whatever path the genomes actually 
took, their rate of evolution must have been tremendous 
during the RNA era, and quasireplication probably had 
a better chance of meeting that challenge than any other 
alternative. 
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